Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style; it is 100% hands on! A few hours prior to each lecture, the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark).
We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about Python and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
Welcome to this fifth lecture in a series of six. We've previously covered data structures, data wrangling and flow control but today we will dive further into an aspect of data wrangling with string manipulation and regular expressions.
At the end of this lecture we will aim to have covered the following topics:
grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.
Today's datasets will focus on using Python lists and the NumPy package
This is a subset of the taxa table we used back in lecture 03. We'll be using it to look at regular expression with the Pandas package and DataFrames.
This is a small example dataset that we'll use to practice some string manipulation and DNA sequence formatting.
IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.
numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.
pandas provides the DataFrame class that allows us to format and play with data in a tabular format.
re provide regular expression matching functions that are similar to those found in the programming language Perl
# ----- Always run this at the beginning of class so we can get multi-command output ----- #
# Access options from the iPython core
from IPython.core.interactiveshell import InteractiveShell
# Change the value of ast_node_interactivity
InteractiveShell.ast_node_interactivity = "all"
# ----- Additional packages we want to import for class ----- #
# Import the pandas package
import pandas as pd
import re
"A God-awful and powerful language for expressing patterns to match in text or for search-and-replace. Frequently described as 'write only', because regular expressions are easier to write than to read/understand. And they are not particularly easy to write." - Jenny Bryan
RegEx is a very powerful and sophisticated way to perform string manipulation. Common uses of string manipulation are: searching, replacing or removing (making substitutions), and splitting and combining substrings.
So why do regular expressions or 'RegEx' get so much flak if it is so powerful for text matching? Scary example: how to verify an email address in different programming languages http://emailregex.com/.
Writing/reading RegEx is definitely one of those situations where you should annotate your code. There are many terrifying urban legends about people coming back to their code and having no idea what their code means.
For our first regex exercise, use Microsoft Word to open the file "regex_word.docx". This file contains one string: "Bob and Bobby went to Runnymede Road for a run and then went apple bobbing.". Here is what we are going to do:
ctrl+h to launch the find/replace functionre Is Python's built-in package for regular expressions. The re module offers a set of functions that facilitates the searching of a string for a match. These functions include:
| Function | Description |
|---|---|
| findall | Returns a list containing all matches |
| finditer | Returns an iterator containing all matches |
| search | Returns a Match object if there is a match anywhere in the string |
| split | Returns a list where the string has been split at each match |
| sub | Replaces one or many matches with a string |
| escape | Escapes any special characters in a pattern that may be unavoidably present |
The re module uses a set of metacharacters and special sequences designed to match strings based on features such as only number, only non-number, only letters, etc. These are all contained in the next three tables.
Metacharacters are characters with a special meaning when interpreted as a regular expression:
| Character | Description | Example |
|---|---|---|
| [ ] | A set of characters | "[a-m]" |
| \ | Signals a special sequence (can also be used to escape special characters) | "\d" |
| . | Any character (except newline character) | "he..o" |
| ^ | The search string starts with | "^hello" |
| \$ | The search string ends with | "world\$" |
| * | Zero or more occurrences | "aix*" |
| + | One or more occurrences | "aix+" |
| { } | Exactly the specified number of occurrences | "al{2}" |
| Match between 0 and 3 repetitions | "al{,3} | |
| Match between 3 and 10 repetitions | "al{3,10} | |
| Match between 3 and infinite repetitions | "al{3,} | |
| | | Either or | "falls|stays" |
| ( ) | Capture and group | ("exact*matches") |
A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:
| Character | Description | Example |
|---|---|---|
| \A | Returns a match if the specified characters are at the beginning of the string | "\AThe" |
| \b | Returns a match where the specified characters are at the beginning or at the end of a word | r"\bain" r"ain\b" |
| \B | Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word | r"\Bain" r"ain\B" |
| \d | Returns a match where the string contains digits (numbers from 0-9) | "\d" |
| \D | Returns a match where the string DOES NOT contain digits | "\D" |
| \s | Returns a match where the string contains a white space character. This includes \s, \t, \n, etc. | "\s" |
| \S | Returns a match where the string DOES NOT contain a white space character | "\S" |
| \w | Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) | "\w" |
| \W | Returns a match where the string DOES NOT contain any word characters | "\W" |
| \Z | Returns a match if the specified characters are at the end of the string | "word\Z" |
A set is a set of characters inside a pair of square brackets [ ] with a special meaning:
| Set | Description |
|---|---|
| [arn] | Returns a match where one of the specified characters (a, r, or n) are present |
| [a-n] | Returns a match for any lower case character, alphabetically between a and n |
| [^arn] | Returns a match for any character EXCEPT a, r, and n |
| [0123] | Returns a match where any of the specified digits (0, 1, 2, or 3) are present |
| [0-9] | Returns a match for any digit between 0 and 9 |
| [0-5][0-9] | Returns a match for any two-digit numbers from 00 and 59 |
| [a-zA-Z] | Returns a match for any character alphabetically between a and z, lower case OR upper case |
| [+] | In sets, +, *, ., |, (), \$, {} have no special meaning, so [+] means: return a match for any + character in the string |
We all have trouble with RegEx. Trouble-shooting RegEx can take time and sometimes you're better off working with a simulator that can help you out. Here are a couple of helpful sites where you can test your RegEx patterns:
https://regex101.com/
https://regexr.com/
Otherwise, know that you are NOT alone in your struggle!
re RegEx library functions¶Time to explore re's functions (https://docs.python.org/3/library/re.html)
# Import the regular expression package
import re
findall() helps to pattern-match within a string¶Sometimes we may want to find the occurrence of a specific pattern within a larger string. A good example is searching for sequence motifs in a larger block of genomic sequence. If we were working with a text editor like Microsoft Word we would use the Find tool and provide the pattern. Any matches would be listed somewhere for us to look at further.
In Python, the findall(pattern, string) function returns all non-overlapping matches of pattern within string as a list of strings. When using capture groups, this will return a list of tuples if the pattern has more than one group. Empty matches will also be returned in the result. Much like a text editor, findall() simply returns a list of the matches it finds, although this doesn't provide it's position within the string!
However, by providing regular expressions to the pattern parameters, we can identify more complex matching patterns, giving us more power and flexibility than a text editor. Let's start with a simple example.
# Print a list of all matches:
txt = "The rain in Spain"
x = re.findall("[rp]ain", txt)
print(x)
['rain', 'pain']
search() returns a Match object¶As we see above we can get back the complex matches from a pattern but we aren't given their position within our string. If we are interested in where a pattern match first occurs in our search string - which is often what we want - then we can use the search() function.
The search(pattern, string) function searches the parameter string for a match to pattern, and returns a Match object if there is a match. If there is more than one match, only the first occurrence of the match will be returned.
Note that there is a similar function match() which only searches for a pattern starting at the first position of an input string. This also returns a Match object. We'll talk more about that soon.
Let's take a look at how search() works.
txt = "The rain in Spain"
x = re.search("[rp]ain", txt) # There are two "in" in txt but only the first match is returned
# What result is returned
x
# What is it's typing?
type(x)
<re.Match object; span=(4, 8), match='rain'>
re.Match
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt) # Anything that starts with "The" and ends in "Spain"
x
<re.Match object; span=(0, 17), match='The rain in Spain'>
# Search for the first white-space character in the string:
txt = "The rain in Spain"
x = re.search("\s", txt)
print("The first white-space character is located in index position:", x.start())
The first white-space character is located in index position: 3
split() function¶Most commonly you will encounter a dataset where columns of variables contain multiple data values concatenated by a character like :, ; or even \t. You may also want to break up sequence data based on specific patterns etc. The split(pattern, string) function returns a list where the string has been split at each match.
The optional parameter maxsplit takes a non-zero value to determine the maximum number of splits to produce, with the remaining unsplit string text being appended as the last element of the return list. The default value of maxsplit is 0.
# Split at each white-space character:
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)
['The', 'rain', 'in', 'Spain']
Let's try an example using the maxsplit parameter.
# Split the string only at the first occurrence:
txt = "The rain in Spain"
x = re.split("\s", txt, maxsplit = 1) # Number of strings = n+1 strings, where n = number of splits). One split creates two new strings, 2 splits creates 3, and so on
print(x)
['The', 'rain in Spain']
sub() function¶In some cases, we don't want to just split the information but would rather replace a pattern we are interested in with a different text entry instead. We can use the sub() function instead which takes the parameters
pattern: our search pattern to replace.repl: the string we will use to replace our pattern.string: the search string we'll be providing as input.count: A positive integer to determine the number of occurrences of pattern to replace. The default, 0, replaces all occurrences.# Replace every white-space character with the number 9:
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)
The9rain9in9Spain
Let's see what happens when we used the count parameter.
# Replace the first 2 occurrences:
txt = "The rain in Spain"
x = re.sub("\s", "9", txt, count = 2)
print(x)
The9rain9in Spain
Part of normalizing text is removing special characters
special_char = 'spec&al_character'
# Replace the & with an i
re.sub('&', 'i', special_char)
'special_character'
re.Match object¶The re.Match object returned from search() and match() has information on the actual pattern match that has occured in our search() call. It also has a number of methods and attributes that we'll find useful for acccessing that information:
start() and end() methods return the indices of the start (inclusive) and end (exclusive) of the substring matched. You can also provide an integer or list of integers corresponding to capture groups from your pattern. span() method returns a (start, end) tuple for your matchgroup() method returns the original matching textstring attribute returns the original search stringre attribute returns the regular expression object (which holds our pattern) used to produce this matchLet's return to one of our first examples.
txt = "The rain in Spain"
x = re.search("[rp]ain", txt) # There are two "ain" words in our txt but only the first match is returned
# Where do we get a match?
x.span()
# What was the matching text?
x.group()
# What was our original search string?
x.string
# What was our original pattern?
x.re
(4, 8)
'rain'
'The rain in Spain'
re.compile(r'[rp]ain', re.UNICODE)
Let's try something more complicated...
# Print the position (start- and end-position) of the first match occurrence.
# The regular expression looks for any words that starts with an upper case "S":
txt = "The rain in Spain"
x = re.search("\\bS\w+", txt)
# What did we search for?
x.re
# Where is our match (if any) located?
x.span()
# What did we find?
x.group()
re.compile(r'\bS\w+', re.UNICODE)
(12, 17)
'Spain'
\ escape character with \ and r¶A couple of notes about the code inside search() in the last block
\ (backslash) is known as an "escape" character. It precedes special characters and "escapes" them, meaning that Python - or other programming languages for thar matter - will know that those escaped characters are not to be interpreted literally. In other words, \$ tells Python that we are using the dollar sign for regex purposes to match the end of a string, and not as a dollar sign per se. Without the escape character, Python would treat it as a regular dollar sign character. The same principle applies to all special characters.
\bS Matches any word in the string that starts with a capital S. Note that \b denotes a word boundary.
\w Matches any word character in a string
Often you will need to "escape" the escape character because the Python interpreter also provides special meaning to the \ character. When reading through a string, the Python interpreter uses \ to help it identify special characters like \t (tab), \n (new line), or \' (treat as an apostrophe instead of end-quote).
Since the \ itself is a special character, and regex patterns are strings, then we need to alert Python to the fact that we are using the \ under a separate context as it interprets a string before passing it along to the regular expression interpreter.
When Python sees \b it will think: "Oh it's a \b. Someone really wants the \x08 character".
When Python sees \\b it will think: "Oh it's a \\. Someone just wants a regular \ character followed by a b. Pass along \b in the literal string.
# Here are some fun examples of what is happening inside text with and without extra escape characters
"this is the \b single-byte character. It happens with \a and \f as well but not \c or \w"
print("this is the \b single-byte character. It happens with \a and \f as well but not \c or \w")
# versus
"this is the \\b character. Here's \\a, \\f, \\c, and \\w"
print("this is the \\b character. Here's \\a, \\f, \\c, and \\w")
'this is the \x08 single-byte character. It happens with \x07 and \x0c as well but not \\c or \\w'
this is the single-byte character. It happens with and as well but not \c or \w
"this is the \\b character. Here's \\a, \\f, \\c, and \\w"
this is the \b character. Here's \a, \f, \c, and \w
As you can guess, it could be tedious to memorize when you actually need to escape your backslashes, like-wise it can be cumbersome to escape every escape character (as you should)! Python provides a convenient solution to this issue.
You can leave all the madness behind by beginning your regex sequences with r. This special tag preceding your regex string will tell Python to treat the string in its raw form without altering any of the backslashes.
In our following example, r is escaping the backslash in \b and \w for us. Try removing r or the first of the two consecutive \ in our first example and see what kind of error you get. Python provides no hint about what the problem is.
# Convert your regular expression into raw text
txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
# What did we search for?
x.re
# Where is our match (if any) located?
x.span()
# What did we find?
x.group()
re.compile(r'\bS\w+', re.UNICODE)
(12, 17)
'Spain'
At this point you should have a strong sense of how important it is to annotate your regex code; It can get obscure very quickly.
We've spent a lot of time working on regular expressions to search strings but we haven't really touched much on the String library itself. You are already familiar with several functions from the String library. Let's review what we already know about String objects.
+¶Let's start with concatenation, which we have used before to print several strings together.
# Use a single + to concatenate
# Inside a print function
print('this is ' + 'a concatenated string')
# outside a print function
'this is ' + 'a concatenated string'
this is a concatenated string
'this is a concatenated string'
# It doesn't really matter if we include multiple concatenations to do the same thing
print('another' + ' ' + 'concatenated string')
another concatenated string
# The print function can automatically concatenate strings for us
# This only works inside print()
print('this is also', 'a concatenated string')
# This just makes a tuple
'this is also', 'a concatenated string'
this is also a concatenated string
('this is also', 'a concatenated string')
Notice than when using the comma syntax in the print() function, a second space inside the strings is not necessary. By default, the print() function uses a space to concatenate input.
Remember that we can index strings much like we do arrays and lists using a 0-indexed positioning system.
string_1 = 'evolution'
string_1[4] # 0 indexing when going forward on the string
string_1[-5] # reverese indexing has one-based indexing, meaning indices start at -1
'u'
'u'
string_1[4:7]
'uti'
Up until now, all of our slicing has been to retrieve all elements between two points. We can, however, apply a third index k to our String index which will return every kth element within our slice. A negative value of k will result in indexing in the reverse direction from the right end of the string.
This form of slicing is also known as striding. Let's see some examples.
# Index from 0 to 7 (exclusive) every two elements (it skips one every time)
string_1[0: 7: 2]
'eoui'
# Retrieve the reverse of our string
string_1[ : : -1]
# any clues where this type of string manipulation can be useful in biology?
'noitulove'
Palindromes are words that can be read in both directions without changing the meaning. Specific palindromic sequences are recognized as restriciton sites by many endonucleases .
# Palindrome reversals
'Evil did I dwell. Lewd I did live'[: : -1]
'A man a plan a canal panama'[: : -1]
# Does this remind your of restriction sites?
'GAATTC'[ : : -1]
# Of course we'd still have to take the complement of this sequence
'evil did I dweL .llewd I did livE'
'amanap lanac a nalp a nam A'
'CTTAAG'
split() method¶This looks a lot like the re.split() function but the split() method breaks up your string, from the left side using a constant sequence. This results in faster processing as there is less overhead than in dealing with regular expressions. For simple and constant splits, use this method instead. Recall that this method also has a maxsplit parameter.
There is a second method rsplit() which performs like split() except it begins from the right side of the string object.
In the example below, string_1 will be split using the letter "o", which is removed from the string.
# split our string using the letter "o"
string_1.split(sep = "o", maxsplit = 1)
# maxsplit is the max number of strings we want
['ev', 'lution']
# Now split from the right side
string_1.rsplit(sep = 'o', maxsplit = 1)
['evoluti', 'n']
join() method¶We encountered this method briefly in the first lecture. Recall that the join() method takes all the items from an input iterable and joins them into a single string. This method is called from a string object which acts as the separator.
cagef_seq = ['CAGEF', 'Sequencing', 'Educational', 'Outreach'] # a list of strings
print('_'.join(cagef_seq)) # joins by underscore
type('_'.join(cagef_seq))
CAGEF_Sequencing_Educational_Outreach
str
splitlines() method to remove line boundaries and return a list¶As we've seen in other lectures, the \n character is used to denote a new line in text. You can use the splitlines() method to break up a string by the \n character and return a list where each element is a line of text. The parameter keeplinebreaks specifies if line breaks should be included as part of each line.
Note below that we've used the ''' triple quote as a way to make our string span multiple lines. This is purely for the purposes of making our code more readable and can be used to also make multi-line strings. This will, however, cause the inclusion of additional \n characters every time a new line is started.
othello = '''RODERIGO\nTush! never tell me; I take it much unkindly\nThat thou, Iago, who hast had my purse
As if the strings were thine, shouldst know of this.'''
othello
print()
othello.splitlines() # outputs a list of strings)
print()
# analogus to
othello_list = othello.split('\n')
othello_list
'RODERIGO\nTush! never tell me; I take it much unkindly\nThat thou, Iago, who hast had my purse\nAs if the strings were thine, shouldst know of this.'
['RODERIGO', 'Tush! never tell me; I take it much unkindly', 'That thou, Iago, who hast had my purse', 'As if the strings were thine, shouldst know of this.']
['RODERIGO', 'Tush! never tell me; I take it much unkindly', 'That thou, Iago, who hast had my purse', 'As if the strings were thine, shouldst know of this.']
# There are many paths to the same place
# The key is knowing how to implement them properly
for i in othello_list:
# You could really do many things to put each element into it's own list
substring_split = i.split("\n")
# substring_split = [i]
print(substring_split)
['RODERIGO'] ['Tush! never tell me; I take it much unkindly'] ['That thou, Iago, who hast had my purse'] ['As if the strings were thine, shouldst know of this.']
Now, a regex example relevant to biologists: A string of DNA.
Have you seen a fasta file before? They are the standard format to represent nucleotide and amino acid sequences using single letter codes in text files, and look like the string dino code below
# notice the triple quotes: Use them to create multiline strings
dino = '''>DinoDNA from Crichton JURASSIC PARK p. 103 nt 1-1200
GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCC
CCTGGAAGCTCCCTCGTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCTGCTCACGCTGTACCTATCTCAGTTCGG
TGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAAGTAGGACAGGT
GCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAGATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGT
CACTCCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCTGGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATG
ATTCTTCTCGCTTCCGGCGGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGA
TCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAG
AGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTC
CCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTAT
CCGGTAACTATCGTCTTGAGTCCAACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCCGCGGTGCATGGAGCCGGGCC
ACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGGCCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGGCCAT
CGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT'''
dino
'>DinoDNA from Crichton JURASSIC PARK p. 103 nt 1-1200 \nGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCC\nCCTGGAAGCTCCCTCGTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCTGCTCACGCTGTACCTATCTCAGTTCGG\nTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAAGTAGGACAGGT\nGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAGATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGT\nCACTCCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCTGGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATG\nATTCTTCTCGCTTCCGGCGGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGA\nTCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAG\nAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTC\nCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTAT\nCCGGTAACTATCGTCTTGAGTCCAACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCCGCGGTGCATGGAGCCGGGCC\nACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGGCCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGGCCAT\nCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT'
*** This piece of DNA is from the book Jurassic park, and was supposed to be dinosaur DNA, but is actually just a cloning vector. Bummer.
This string is in FASTA format, but we don't need the header; we just want to deal with the DNA sequence. The header begins with '>' and ends with a number, '1200', with a space between the header and the sequence. Let's practice capturing each of these parts of a string, and then we'll make a raw regular expression to remove the entire header.
'>' Is at the beginning of the string so we can use the function re.sub() which looks at the beginning of a string.
# Import our re library if it hasn't already been imported
import re
# Use the sub function to pull the ">" and replace it with nothing
re.sub(r'>', '', dino)
'DinoDNA from Crichton JURASSIC PARK p. 103 nt 1-1200 \nGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCC\nCCTGGAAGCTCCCTCGTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCTGCTCACGCTGTACCTATCTCAGTTCGG\nTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAAGTAGGACAGGT\nGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAGATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGT\nCACTCCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCTGGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATG\nATTCTTCTCGCTTCCGGCGGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGA\nTCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAG\nAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTC\nCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTAT\nCCGGTAACTATCGTCTTGAGTCCAACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCCGCGGTGCATGGAGCCGGGCC\nACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGGCCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGGCCAT\nCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT'
lstrip() and rstrip() to remove characters¶In case you are only interested in removing the characters beginning at the first position or end (last position) of string, you can use the lstrip() (left) and rstrip() (right) methods respectively. Given a set of characters, they will continue to remove leading or trailing sequence that matches those characters.
# Remove some leading characters from our string
dino.lstrip(r'>Dino')
'NA from Crichton JURASSIC PARK p. 103 nt 1-1200 \nGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCC\nCCTGGAAGCTCCCTCGTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCTGCTCACGCTGTACCTATCTCAGTTCGG\nTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAAGTAGGACAGGT\nGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAGATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGT\nCACTCCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCTGGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATG\nATTCTTCTCGCTTCCGGCGGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGA\nTCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAG\nAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTC\nCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTAT\nCCGGTAACTATCGTCTTGAGTCCAACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCCGCGGTGCATGGAGCCGGGCC\nACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGGCCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGGCCAT\nCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT'
Next we can search for numbers. The expression [0-9] is looking for any number. Always make sure to check that the pattern you are using gives you the output you expect.
We'll play with the sub() function to see how this works.
# Search for all digits 0-9 and replace with nothing
re.sub(r'[0-9]', '', dino) # \d any digit, + is more than one digit
'>DinoDNA from Crichton JURASSIC PARK p. nt - \nGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCC\nCCTGGAAGCTCCCTCGTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCTGCTCACGCTGTACCTATCTCAGTTCGG\nTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAAGTAGGACAGGT\nGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAGATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGT\nCACTCCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCTGGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATG\nATTCTTCTCGCTTCCGGCGGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGA\nTCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAG\nAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTC\nCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTAT\nCCGGTAACTATCGTCTTGAGTCCAACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCCGCGGTGCATGGAGCCGGGCC\nACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGGCCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGGCCAT\nCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT'
\s pattern¶How do we capture spaces? The pattern \s denotes a space. However, for the backslash to not be used as an escape character (its special function), we need to add another backslash, making our pattern \\s or we need to use the raw format for our regular expression.
We'll use the string replace() method to replace these with a blank character (or nothing) but this method does not accept regular expressions as input so we must use ' ' as input instead.
# Will this replace all whitespace characters?
dino.replace(' ', '')
'>DinoDNAfromCrichtonJURASSICPARKp.103nt1-1200\nGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCC\nCCTGGAAGCTCCCTCGTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCTGCTCACGCTGTACCTATCTCAGTTCGG\nTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAAGTAGGACAGGT\nGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAGATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGT\nCACTCCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCTGGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATG\nATTCTTCTCGCTTCCGGCGGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGA\nTCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAG\nAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTC\nCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTAT\nCCGGTAACTATCGTCTTGAGTCCAACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCCGCGGTGCATGGAGCCGGGCC\nACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGGCCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGGCCAT\nCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT'
sub()¶As you can see from above our replace() call only replaced actual spaces and did not replace any of the newline characters - which we also want to fix. Unlike the replace() method, the re.sub() function interprets \s as any whitespace character! Let's see it in action!
# Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].re.sub('\s', '', dino)
re.sub(r'\s', '', dino)
'>DinoDNAfromCrichtonJURASSICPARKp.103nt1-1200GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCGTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCTGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAAGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAGATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACTCCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCTGGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCCGCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGGCCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGGCCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT'
To remove the entire header, we need to combine the patterns we've tested. The header is everything in between '>' and the number '1200' followed by a space. Recall:
. captures any single character.* is any number of times (including zero).^ denotes the pattern match must begin at the start of the search string.# Use findall to confirm what our regex is finding
re.findall(r'^>.*[0-9]\s|\n', dino)
# | (or) statement needed to remove all line breaks, otherwise only the first one is removed
# Substitute that into sub()
re.sub(r'^>.*[0-9]\s|\n', '', dino) # ^ matches the beggining of line
['>DinoDNA from Crichton JURASSIC PARK p. 103 nt 1-1200 ', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n']
'GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCGTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCTGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAAGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAGATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACTCCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCTGGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCCGCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGGCCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGGCCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT'
Here's our regex pattern: ^>.*[0-9]\s|\n.
Which retrieves: ">DinoDNA from Crichton JURASSIC PARK p. 103 nt 1-1200 ".
Why doesn't it give us: ">DinoDNA from Crichton JURASSIC PARK p. 103 "?
Let's break it down. In our above code we generate the regex: ^>.*[0-9]\s|\n which includes the * qualifier. Recall from our table that the qualifiers *, +, {m, n},and ? allow us to search for pattern matches in a range from (0, infinity), (1, infinity), (m, n) and (0, 1) respectively. These qualifiers are implemented in a greedy fashion meaning they will continue matching as many characters as possible within the range.
We can, however, choose a lazy/non-greedy/minimal matching approach where these qualifiers match as few characters as possible. To switch our qualifiers to lazy matching we can add another ?. Yes that's right, there's yet another metacharacter meaning to *?, +?, {m,n}? and ??! The second ? signals to the Regex interpreter that you'd like to implement those qualifiers with lazy matching!
Let's see what happens if we update our regex pattern.
# Use findall to with lazy matching
re.findall(r'^>.*?[0-9]\s|\n', dino)
['>DinoDNA from Crichton JURASSIC PARK p. 103 ', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n']
Just as we predicted, it made the shorter match instead! So remember that the default of these qualifiers is to produce greedy matching!
In our case, of course, we want greedy matching to replace the entire header. Let’s save the dna into its own object.
# Save our replaced result to a variable: dna
dna = re.sub('^>.*[0-9]\s|\n', '', dino)
dna
'GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCGTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCTGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAAGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAGATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACTCCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCTGGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCCGCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGGCCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGGCCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT'
We may also want to retain our header in a separate string rather than just removing it. In that case, we want use a function like search() which will retain the string that matches our pattern, rather than removing it. We can save this in an object called header and then access the header by calling the group() method.
# Use search. Remember it only returns the first result!
header = re.search(r'^>.*[0-9]\s', dino)
# Retrieve the matching pattern
header = header.group()
header
'>DinoDNA from Crichton JURASSIC PARK p. 103 nt 1-1200 '
Now that we understand greedy matching we can also introduce the idea of capture groups. In our above Regex, we know now that from ^>.*[0-9]\s the .* is being matched to DinoDNA from Crichton JURASSIC PARK p. 103 nt 1-120 because the . will accept any character value. You may be skeptical but we can prove this with the capture group!
The capture group is denoted by parentheses ( ) in our regex patterns and can be used to capture subgroups from our pattern. This can be useful if you want to re-insert the information later or break it into multiple columns.
To access capture groups from a match object, use the group(index) method with an index value where 0 holds the entire match, and each capture group is at increasing indices. You can also access all groups as a tuple with the groups() method.
You can even name your capture groups in your regular expression! Let's keep it simple for now.
header_capture = re.search(r'(^>)(.*)([0-9]\s)', dino)
# You can access all capture groups as a tuple (an iterable!)
header_capture.groups()
('>', 'DinoDNA from Crichton JURASSIC PARK p. 103 nt 1-120', '0 ')
Now we can look for patterns in our (dino) DNA! Does this DNA have balanced GC content? We can use re.findall() function to capture every character that is either a G or a C. We'll start by just viewing the start of our search.
re.findall(r'G|C', dna)[0:6] # just the first 6 hits for simplicity
['G', 'C', 'G', 'G', 'C', 'G']
# How many hits are there vs overall length?
(len(re.findall(r'G|C', dna)) /len(dna)) * 100 # CG percent
60.0
Let's translate our dinosaur DNA into mRNA! There are two different ways we can explore to replace multiple patterns at once.
First, we can use the string replace() method through method-chaining on the different bases to replace. Note that in our example we need to call on replace() five times! The extra replace() call allows us to initially create a placeholder symbol for transcribing G to C. Otherwise when we replace C with G in the second replace() call, we'll also change the bases that we just altered.
# Use method-chaining to convert our sequence to mRNA
dna.replace('G', 'X').replace('C', 'G').replace('A', 'U').replace('T', 'A').replace('X', 'C')
# this syntax does sequential replacement, so characters that have already been replaced, are replaced again
'CGCAACGACCGCAAAAAGGUAUCCGAGGCGGGGGGACUGCUCGUAGUGUUUUUAGCUGCGCCACCGCUUUGGGCUGUCCUGAUAUUUCUAUGGUCCGCAAAGGGGGACCUUCGAGGGAGCACAAGGCUGGGACGGCGAAUGGCCUAUGGACAGGCGGAAAGAGGGAAGCCCUUCGCACCGACGAGUGCGACAUGGAUAGAGUCAAGCCACAUCCAGCAAGCGAGGUUCGACCCGACACACGGCAAGUCGGGCUGGCGACGCGGAAUAGGCCAUUGAUAGCAGAACUCAGGUUGGGCCAUUUCAUCCUGUCCACGGCCGUCGCGAGACCCAGUAAAAGCCGCUCCUGGCGAAAGCGACCUCUAGCCGGACAGCGAACGCCAUAAGCCUUAGAACGUGCGGGAGCGAGUUCGGAAGCAGUGAGGUUUGCAAAGCCGCUCUUCGUCCGGUAAUAGCGGCCGUACCGCCGGCUGCGCGACCCGACCGCAAGCGCUGCGCUCCGACCUACCGGAAGGGGUAAUACUAAGAAGAGCGAAGGCCGCCGGGCGCAACGUCCGGUACGACAGGUCCGUCCAUCUACUGCUGGUAGUCCCUGUCGAAGUUGCCGAGAAUGGUCGGAUUGAAGCUAGUGACCUGGCGACUAGCAGUGCCGCUAAAUACGGCGUGUACCUGCGCAACGACCGCAAAAAGGUAUCCGAGGCGGGGGGACUGCUCGUAGUGUUUGUUCAGUCUCCACCGCUUUGGGCUGUCCUGAUAUUUCUAUGGUCCGCAAAGGGGGACCUUCGCGAGAGGACAAGGCUGGGACGGCGAAUGGCCUAUGGACAGGCGGAAAGAGGGAAGCCCGAAAGAGUUACGAGUGCGACAUCCAUAGAGUCAAGCCACAUCCAGCAAGCGAGGUUCGACUGCUUGGGGGGCAAGUCGGGCUGGCGACGCGGAAUAGGCCAUUGAUAGCAGAACUCAGGUUGUGCUGAAUUGCCCAACCGUACCUAACAUCCGCGGCGGGAUAUGGAACAGACGGAGGGGCGCCACGUACCUCGGCCCGGUGGAGCUGGACUUACCUUCGGCCGCCGUGGAGCGAUUGCCGGUUCUUAACCUCGGUUAGUUAAGAACGCCUCUUGACACUUACGCGUUUGGUUGGGAACCGGUAGCGCAGGCGGUAGAGGUCGUCGGCGUGCGCCGCGUAGAGCCCGUCGCAACCCAGGA'
lambda function to help replace multiple patterns/values¶A lamba function is a small anonymous function that can take any number of arguments but it can only have one expression. It takes the form lambda arguments: expression. This syntax allows a developer to quickly generate a function that might otherwise require many lines of code to implement (see upcoming Lecture 06).
The lambda function can only execute expressions. That means, the result can evaluate to a single value wherease a statement is something like a print() call or variable assignment with the =.
Let's run through a quick example.
# We can make a function that takes three variables
# And returns a single values
test_func = lambda x, y, z: (x+y)*z
# Give it a try on our named function test_func()
test_func(2, 4, 6)
36
Seems pretty straighforward right? Let's take it to the next level.
sub() in conjunction with a lambda function¶Use the re.sub() function in combination with a lambda function. When sub() finds a match, it will generate a re.Match object and pass that into our lambda function. Within our lambda function we'll implement a dictionary object. Note that the dictionary object will have access to all of it's regular methods and attributes as well.
In our lambda function we will write a dictionary to hold all of the find:replace pairs as key:value pairs. Recall the dictionary method get(key, alt_value) which will return the value associated with key, otherwise it returns alt_value if the key is not found.
With this method there will be no need to generate a placeholder value like before, and it will be much cleaner to alter code later if needed.
mrna = re.sub('.', # Find any character
lambda m: {'G':'C', 'C':'G', 'A':'U', 'T':'A'}.get(m.group(), "X"), # Search for the character as a key
dna # Use 'dna' as the initial search string
) # this does what we wanted
# Show us the updated result
mrna
'CGCAACGACCGCAAAAAGGUAUCCGAGGCGGGGGGACUGCUCGUAGUGUUUUUAGCUGCGCCACCGCUUUGGGCUGUCCUGAUAUUUCUAUGGUCCGCAAAGGGGGACCUUCGAGGGAGCACAAGGCUGGGACGGCGAAUGGCCUAUGGACAGGCGGAAAGAGGGAAGCCCUUCGCACCGACGAGUGCGACAUGGAUAGAGUCAAGCCACAUCCAGCAAGCGAGGUUCGACCCGACACACGGCAAGUCGGGCUGGCGACGCGGAAUAGGCCAUUGAUAGCAGAACUCAGGUUGGGCCAUUUCAUCCUGUCCACGGCCGUCGCGAGACCCAGUAAAAGCCGCUCCUGGCGAAAGCGACCUCUAGCCGGACAGCGAACGCCAUAAGCCUUAGAACGUGCGGGAGCGAGUUCGGAAGCAGUGAGGUUUGCAAAGCCGCUCUUCGUCCGGUAAUAGCGGCCGUACCGCCGGCUGCGCGACCCGACCGCAAGCGCUGCGCUCCGACCUACCGGAAGGGGUAAUACUAAGAAGAGCGAAGGCCGCCGGGCGCAACGUCCGGUACGACAGGUCCGUCCAUCUACUGCUGGUAGUCCCUGUCGAAGUUGCCGAGAAUGGUCGGAUUGAAGCUAGUGACCUGGCGACUAGCAGUGCCGCUAAAUACGGCGUGUACCUGCGCAACGACCGCAAAAAGGUAUCCGAGGCGGGGGGACUGCUCGUAGUGUUUGUUCAGUCUCCACCGCUUUGGGCUGUCCUGAUAUUUCUAUGGUCCGCAAAGGGGGACCUUCGCGAGAGGACAAGGCUGGGACGGCGAAUGGCCUAUGGACAGGCGGAAAGAGGGAAGCCCGAAAGAGUUACGAGUGCGACAUCCAUAGAGUCAAGCCACAUCCAGCAAGCGAGGUUCGACUGCUUGGGGGGCAAGUCGGGCUGGCGACGCGGAAUAGGCCAUUGAUAGCAGAACUCAGGUUGUGCUGAAUUGCCCAACCGUACCUAACAUCCGCGGCGGGAUAUGGAACAGACGGAGGGGCGCCACGUACCUCGGCCCGGUGGAGCUGGACUUACCUUCGGCCGCCGUGGAGCGAUUGCCGGUUCUUAACCUCGGUUAGUUAAGAACGCCUCUUGACACUUACGCGUUUGGUUGGGAACCGGUAGCGCAGGCGGUAGAGGUCGUCGGCGUGCGCCGCGUAGAGCCCGUCGCAACCCAGGA'
Is there even a start codon in this thing? Let's use the re.search() function to check.
re.search('AUG', mrna) # reports back only the first hit, says that goes from position 89 to 92 non-inclusive
<re.Match object; span=(89, 92), match='AUG'>
# corroborate the findings
mrna[89:92]
'AUG'
findall() to look for multiple hits¶It might be more useful to know exactly how many possible start codons we have. len(re.findall()) Counts the number of matches in the string for our pattern.
# Use the len() function with findall() to count instances of pattern matches
len(re.findall('AUG', mrna))
9
finditer()¶So we know there are 9 hits somewhere in our pattern but we don't exactly know where. A quick way to locate where we have our hits is to use the re.finditer() function which will return a sequence of re.Match objects as an iterator. Remember iterators from last week? How about list comprehension?
We'll use both in the next example to generate a list of start and end positions for all hits on AUG.
# Wrap our list comprehension in [] or list()
[(m.start(0), m.end(0)) for m in re.finditer('AUG', mrna)] # m invokes MatchObject instances (built-in)
# Or cast it as a list with the span(). Same result
list((m.span(0)) for m in re.finditer('AUG', mrna))
[(89, 92), (138, 141), (145, 148), (191, 194), (607, 610), (758, 761), (807, 810), (814, 817), (1002, 1005)]
[(89, 92), (138, 141), (145, 148), (191, 194), (607, 610), (758, 761), (807, 810), (814, 817), (1002, 1005)]
range() call¶Let's split this string into substrings of codons. Now that we've reviewed list comprehension, we'll use that in conjunction with generating string striding which we discussed earlier. To accomplish that we'll use the range() function to make an iterator. Recall that we can include a step argument $k$ in our call to range() to produce values at every $k^{th}$ value in our range.
Recall that our start codon is located at position 89.
# Recall how range() works
# We'll unpack the iterator values with * inside our print() call
print(*range(0, 10, 3))
0 3 6 9
# We want a range that goes up to the 3rd last base in our mRNA sequence
codon_list = [mrna[i:i+3] for i in range(89, len(mrna)-2, 3)]
mrna[89:119] # first 30 bases of mrna
codon_list[0:10] # first 10 codons
'AUGGUCCGCAAAGGGGGACCUUCGAGGGAG'
['AUG', 'GUC', 'CGC', 'AAA', 'GGG', 'GGA', 'CCU', 'UCG', 'AGG', 'GAG']
How many times do we see a stop codon in our codon list? We can accomplish a count also using list comprehension and the keyword in.
# How many times do we see a stop codon in our codon list?
sum( # We can sum up a boolean list to get how many True values are present
(x in ["UAG","UGA","UAA"] # Use our iterator and check if it is in the list
for x in codon_list) # Generate an iterator from each element of the list
)
13
Of course we can convert the list back into a string using the join() method. Let's put a delimiter between them so we can still track individual codons.
# Use the join() method to put our codons together
codon_str = '.'.join(codon_list)
codon_str[:40]
'AUG.GUC.CGC.AAA.GGG.GGA.CCU.UCG.AGG.GAG.'
Now our codons are stored and delimited in codon_str. Do we have a stop codon anywhere in our reading frame? Let's check with re.findall()
# Remember what findall() does?
re.findall(r'UAG|UGA|UAA', codon_str)
['UAA', 'UAA', 'UAG', 'UGA', 'UAG', 'UGA', 'UAG', 'UAA', 'UGA', 'UAA', 'UAG', 'UAG', 'UAG']
How many stops codons are there in codon_str?
len(re.findall(r'UAG|UGA|UAA', codon_str))
13
Where are the stop codons in codon_str located?
list((m.start(0), m.end(0), m.group()) for m in re.finditer(r'UAG|UGA|UAA', codon_str))
[(388, 391, 'UAA'), (476, 479, 'UAA'), (480, 483, 'UAG'), (704, 707, 'UGA'), (712, 715, 'UAG'), (716, 719, 'UGA'), (732, 735, 'UAG'), (748, 751, 'UAA'), (1168, 1171, 'UGA'), (1348, 1351, 'UAA'), (1404, 1407, 'UAG'), (1420, 1423, 'UAG'), (1452, 1455, 'UAG')]
So our findall() results from codon_str match up with our list comprehension search using codon_list. That's great!
codon_str by stop codon sequence with split()¶Let's subset codons based on stop codons. This will create 14 genetic sequences (remember that we have 13 stop codons). We can use the re.split() function to accomplish the task. When we print our results we'll also remove the extra . delimiter we've inserted between codons.
# split using multiple delimeters (stop codons in this case), and save them as a list
translation = re.split('UAG|UGA|UAA', codon_str) # stops at the first encounter of either stop codon
print('Number of strings after splitting: ' + str(len(translation))) # 14 strings after splitting with stop codons
print(type(translation)) # Translation is a list
counter = 1
# Print each string produced by re.split()
for i in translation:
# Print our split strings and remove the '.' between all codons
print('split ' + str(counter) + ':' + re.sub(r'\.', '', i) + '\n')
counter += 1
Number of strings after splitting: 14 <class 'list'> split 1:AUGGUCCGCAAAGGGGGACCUUCGAGGGAGCACAAGGCUGGGACGGCGAAUGGCCUAUGGACAGGCGGAAAGAGGGAAGCCCUUCGCACCGACGAGUGCGACAUGGAUAGAGUCAAGCCACAUCCAGCAAGCGAGGUUCGACCCGACACACGGCAAGUCGGGCUGGCGACGCGGAAUAGGCCAUUGAUAGCAGAACUCAGGUUGGGCCAUUUCAUCCUGUCCACGGCCGUCGCGAGACCCAGUAAAAGCCGCUCCUGGCGAAAGCGACCUCUAGCCGGACAGCGAACGCCA split 2:GCCUUAGAACGUGCGGGAGCGAGUUCGGAAGCAGUGAGGUUUGCAAAGCCGCUCUUCGUCCGG split 3: split 4:CGGCCGUACCGCCGGCUGCGCGACCCGACCGCAAGCGCUGCGCUCCGACCUACCGGAAGGGGUAAUACUAAGAAGAGCGAAGGCCGCCGGGCGCAACGUCCGGUACGACAGGUCCGUCCAUCUACUGCUGGUAGUCCCUGUCGAAGUUGCCGAGAAUGGUCGGAU split 5:AGC split 6: split 7:CCUGGCGAC split 8:CAGUGCCGC split 9:AUACGGCGUGUACCUGCGCAACGACCGCAAAAAGGUAUCCGAGGCGGGGGGACUGCUCGUAGUGUUUGUUCAGUCUCCACCGCUUUGGGCUGUCCUGAUAUUUCUAUGGUCCGCAAAGGGGGACCUUCGCGAGAGGACAAGGCUGGGACGGCGAAUGGCCUAUGGACAGGCGGAAAGAGGGAAGCCCGAAAGAGUUACGAGUGCGACAUCCAUAGAGUCAAGCCACAUCCAGCAAGCGAGGUUCGACUGCUUGGGGGGCAAGUCGGGCUGGCGACGCGGAAUAGGCCAUUGAUAGCAGAACUCAGGUUGUGC split 10:AUUGCCCAACCGUACCUAACAUCCGCGGCGGGAUAUGGAACAGACGGAGGGGCGCCACGUACCUCGGCCCGGUGGAGCUGGACUUACCUUCGGCCGCCGUGGAGCGAUUGCCGGUUCUUAACCUCGGUUAGU split 11:GAACGCCUCUUGACACUUACGCGUUUGGUUGGGAACCGG split 12:CGCAGGCGG split 13:AGGUCGUCGGCGUGCGCCGCG split 14:AGCCCGUCGCAACCCAGG
Let's go back to our translated codon_str string and further translate them into proteins. First, we need a dictionary where the keys are the translated codons and the values are amino acid codes.
# # DNA to Protein
# {'TTT':'F', 'TTC':'F', # Phenylalanine
# 'TTA':'L', 'TTG':'L', 'CTT':'L', 'CTC':'L', 'CTA':'L', 'CTG':'L', # Leucine
# 'ATT':'I', 'ATC':'I', 'ATA':'I', # Isoleucine
# 'ATG', # Methionine
# 'GTT':'V' , 'GTC':'V', 'GTA':'V', 'GTG':'V', # Valine
# 'TCT':'S', 'TCC':'S', 'TCA':'S', 'TCG':'S', # Serine
# 'CCT':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P', # Proline
# 'ACT':'T', 'ACC':'T', 'ACA':'T', 'ACG':'T', # Threonine
# 'GCT':'A', 'GCT':'A', 'GCA':'A', 'GCG':'A', # Alanine
# 'TAT':'Y', 'TAC':'Y', # Tyrosine
# 'TAA':'', 'TAG':'', 'TGA':'',# Stop
# 'CAT':'H', 'CAC':'H', # Histidine
# 'CAA':'Q', 'CAG':'Q', # Glutamine
# 'AAT':'N', 'AAC', # Asparagine
# 'AAT':'N', 'AAC':'N', # Asparagine
# 'AAA':'K', 'AAG':'K', # Lysine
# 'GAT':'D', 'GAC' :'D', # Aspartic acid
# 'GAA':'E', 'GAG':'E', # Glutamic acid
# 'TGT':'C', 'TGC':'C', # Cysteine
# 'TGG':'W', # Tryptophan
# 'CGT':'R', 'CGC':'R', 'CGA':'R', 'CGG':'R', 'AGA':'R', 'AGG':'R', # Arginine
# 'AGT':'S', 'AGC':'S', # Serine
# 'GGT':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G' # Glycine
# }
# mRNA to Protein
translation_aminoacids = {
'UUU':'F', 'UUC':'F', # Phenylalanine
'UUA':'L', 'UUG':'L', 'CUU':'L', 'CUC':'L', 'CUA':'L', 'CUG':'L', # Leucine
'AUU':'I', 'AUC':'I', 'AUA':'I', # Isoleucine
'AUG':'M', # Methionine
'GUU':'V' , 'GUC':'V', 'GUA':'V', 'GUG':'V', # Valine
'UCU':'S', 'UCC':'S', 'UCA':'S', 'UCG':'S', # Serine
'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P', # Proline
'ACU':'U', 'ACC':'U', 'ACA':'U', 'ACG':'U', # Threonine
'GCU':'A', 'GCU':'A', 'GCA':'A', 'GCG':'A', # Alanine
'UAU':'Y', 'UAC':'Y', # Tyrosine
'UAA':'', 'UAG':'', 'UGA':'',# Stop. We will translate stop codons into nothing for this excercise
'AUG':'START', # Start
'CAU':'H', 'CAC':'H', # Histidine
'CAA':'Q', 'CAG':'Q', # Glutamine
'AAU':'N', 'AAC':'N', # Asparagine
'AAA':'K', 'AAG':'K', # Lysine
'GAU':'D', 'GAC':'D', # Aspartic acid
'GAA':'E', 'GAG':'E', # Glutamic acid
'UGU':'C', 'UGC':'C', # Cysteine
'UGG':'W', # Tryptophan
'CGU':'R', 'CGC':'R', 'CGA':'R', 'CGG':'R', 'AGA':'R', 'AGG':'R', # Arginine
'AGU':'S', 'AGC':'S', # Serine
'GGU':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G' # Glycine
}
Let's do a mixture of lambda functions and list comprehension again to accomplish our goal on the codon_str object. You'll notice that we don't really need to employ any regular expressions now that we've formatted our data.
# Clean up our codon sequence and remove the additional "." characters
codon_str_clean = re.sub('\.', '', codon_str)
# Assign a lambda function (to keep the code clean)
translate = lambda loc, seq, codon_dict: codon_dict.get(seq[loc:loc+3], "X")
# Here's how it works
translate(3, codon_str_clean, translation_aminoacids)
'V'
# Generate a range that loops through codon positions, translates them, and then joins it all together
# The list comprehension will return a list
''.join([translate(x, codon_str_clean, translation_aminoacids) for x in range(0, len(codon_str_clean)-2, 3)])
'STARTVRKGGPSREHKAGUANGLWUGGKREXLRUDECDSTARTDRVKPHPASEVRPDURQVGLAURNRPLIAELRLGHFILSUXVARPSKSRSWRKRPLXGQRUPXLERAGASSEAVRFAKPLFVRRPYRRLRDPUASAALRPUGRGNUKKSEGRRAQRPVRQVRPSUAGSPCRSCREWSDSPGDQCRIRRVPAQRPQKGIRGGGUARSVCSVSUALGCPDISSTARTVRKGGPSREDKAGUANGLWUGGKREXRKSYECDIHRVKPHPASEVRLLGGQVGLAURNRPLIAELRLCIXQPYLUSAAGYGUDGGAPRUSXRWSWUYLRPPWSDCRFLUSVSERLLULURLVGNRRRRRSSACXASPSQPR'
# Or you could work with codon_list and iterate through it directly
''.join([translation_aminoacids.get(x, "X") for x in codon_list])
And now we have a protein sequence!
Let's return to the human_microbiome_project_otu_taxa_table_subset.csv file from lecture 03. Recall that the dataset had 7 rows, of which, row 0 was a semicolon-delimited version of the other 6 rows of data. This time around we'll do the following:
x__ prefix that occurs as part of each entryLet's give it a try!
# Import our two packages
import pandas as pd
import numpy as np
# Read in human_microbiome_project_otu_taxa_table_subset.csv
# fill empty cells with NaN -default behavior
data = pd.read_csv('data/human_microbiome_project_otu_taxa_table_subset.csv', na_values=(''))
data.head()
To retain row 0 we'll simply pull the row and re-assign it to the data variable.
# Remember how to access a specific row?
data = data.loc[...]
data
drop() method¶Recall that we can remove columns with the drop() method. In this case, we don't need to retain the information in column 0, so remove it. We want to do this before we melt the data, otherwise the column will get copied for each new melted observation.
# Drop the first column
data.drop('Unnamed: 0', 1, inplace = ...) # remove something with that name from axis 1
data.head()
Recall that in the process of melting we will convert all of the column names to a single column with one row per column name - these become our observations. All of the values in each column will be relocated as values for the appropriate observations in the melted table.
data_melt = pd.melt(data,
var_name = 'OTU', # Name to use for the 'variable' column (a column containing old column names)
value_vars = ...)#, # Column(s) to unpivot (the list of columns that you want to melt)
data_melt.head()
; delimiters¶Now that we have the table melted, we need to split the value column based on the sequence of text. Let's take a look at an example of the text:
Root;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus
From the data you can see that there is a bit of a pattern emerging. We'll have to make the assumption that if any information is missing, it comes later in the hierarchy sequence for each entry.
We'll take the obvious route and split on the ; first using the pandas.Series.str.split() method. This method splits strings around a given separator/delimiter just like the str.split() method. However, it take the following parameters:
pat: a string or regular expressionn: limit the number of splits where None, 0, and -1 (default) will return all splitsexpand: a boolean where True returns a DataFrame with expanded dimensionality (ie extra columns) orFalse returns a Series/Index, containing lists of stringsIf, in the course of splitting, the number of splits doesn't match the shape of the current output DataFrame, it pads the missing values with the None value.
# data_melt[data_melt['value'].str.split('\;', expand=True)]['OTU'] # next, rename columns
# Take the melted data and split on the value column
data_melt_split = data_melt['value'].str.split(..., expand=True) # next, rename columns
# Take a peek at the result
data_melt_split.head()
None values using the fill.na() method¶Looks like we did generate some None values from our split call. We can replace those with NaN values using the fill.na() method. This method will recognize the Python object None and replace it with whatever value you want. In this case, we'll use the numpy NaN object.
# We'll change the dataframe in place so that we aren't making extra objects. It will return nothing in this case.
data_melt_split.fillna(value = ..., inplace = True)
data_melt_split.head()
concat() function¶We now have our original data_melt DataFrame and the split column data_melt_split. We don't want all of either DataFrames but rather just the OTU information from data_melt and the last 5 columns of data_melt_split. Since we haven't sorted any of the DataFrames, the observations should line up correctly. In that case we can just concatenate the columns using the pd.concat() function.
We'll specify the axis parameter to concatenate columns.
# Concatenate portions of our dataframe by columns
data_melt_split_concat = pd.concat([data_melt['OTU'], data_melt_split.iloc[:, 1:]], ...)
# Check out the result
data_melt_split_concat.head()
Now that we have the information columns we desire, we can rename the columns using the columns property.
# Rename our columns
data_melt_split_concat.columns = ['OTU', 'PHYLUM', 'CLASS', 'ORDER', 'FAMILY', 'GENUS']
# Check the result
data_melt_split_concat.head()
x__ prefixes from our values¶Next, we need to remove the double underscores and any preceeding character indicating the taxonomic rank. We'll use the re.sub() function to first get the regular expression working on a sample string.
re.sub(..., '', 'p__Actinobacteria')
replace() method¶Our code works and in theory should cover all instances of the prefix so now we can apply it to the DataFrame. To do so, we'll use the DataFrame.replace() method which has the following relevant parameters:
to_replace: how to find the entries that will be replaced. This search can come in many forms including a numeric value, a string, or a regular expression. There are more replacement forms as well found in the Pandas docs.value: replace matching entries with this valueinplace: perform operation in place if True and return Noneregex: boolean for whether to interpret to_replace values as regular expressions.Overall, the replace() method looks like a DataFrame version of re.sub() right? Let's give it a try.
# Use the replace method to drop the prefix in our entries
# ^.\__ = p__Actinobacteria and so on for each tax rank. Replace with nothing
data_melt_split_concat.replace('^.\__', '', ..., inplace=True)
# Take a look at the outcome
data_melt_split_concat.head()
All done! Is there a place in our steps where we could have optimized the process? Yes, as long as the data is well-formed/consistent we can make some assumptions and expand our regex pattern to account for a both the ; and .__ pattern as a delimiter. Doing so can save us a call to replace().
# Use a better regex pattern for your split step!
data_melt['value'].str.split(..., expand=True)
Finally, let's read in sequences.tsv, a tab-separated value file (inspect it using a spreadsheet software). This file contains two unnamed columns, one with the sequence name and another with the sequence itself. Our task is to convert each sequence header and sequence string into a format that resembles a fasta format. Here is a list of what we will do to the file:
Let's get to it!
# Import our pandas package
import pandas as pd
# Read in the sequences file as a DataFrame
# seq = pd.read_csv('data/sequences.tsv', sep = '\t', header=None)
seq = pd.read_csv(..., sep="\t", header=None)
seq
# Read in the file WITHOUT specifying a separator
seq = pd.read_csv('data/sequences.tsv', header=None)
# Take a look at it
seq.head()
# How big is it?
seq.info()
> character to each sequence line¶You'll note that instead of separating our columns, we actally just created a single column with our header and sequence joined by the \t separator. Now we'll add '>' at the beginning of each sequence using simple string concatenation which will broadcast to each value in our DataFrame.
# Concatenate the ">" to each value
seq = '>' + ...
# Look at the result
seq
\t¶Looking at the resulting output we know that it's not quite there yet. Recall that the fasta format is:
>header_1
sequence_1
>header_2
sequence_2
This translates to essentially >header_1\nsequence_1 for which we already know the right function to help us out with altering our DataFrame: the replace() method.
import re
# Use the replace method
seq.replace(..., '\n', regex = True, inplace=True)
# Take a look at the result
seq
to_csv()¶Recall we can save our DataFrame to disk using a number of methods available. We'll use the to_csv() method we learned about in lecture 03 to write seq to a file named sequences_line_break.fasta. Since there is only a single column, this will end up writing a fasta-like file with newline characters separating each entry.
# Remember to drop the index and headers when writing the data
seq.to_csv('data/sequences_line_break.fasta', index=..., header=...)
Use regex to bring back sequences_line_break.fasta to its original format sequences.tsv. Write the resulting data frame as sequence_back.csv.
import re
seq_back = pd.read_csv('data/sequences_line_break.fasta', header=None)
# Replace \n by \t
seq_back.replace(..., ..., regex = True, inplace=True)
# Remove '<'
seq_back.replace(..., ..., regex=True, inplace=True)
# Show us the output before we write to disk
seq_back
# Write sequence_back.csv
import csv
seq_back.to_csv('data/seq_back.csv', index=False, header=False, quoting=csv.QUOTE_NONE, escapechar='\\')
And, that is it. Have fun playing with regex, and, for your own sanity, annotate your code!!!
That's our fifth class on Python! We worked through a lot of important functions used for regular expressions and string manipulation. To summarize, we've covered:
Soon after the end of this lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete chapters 1-3 (Basic Concepts of String Manipulation, 1050 possible points; Formatting Strings, 1050 possible points; and Regular Expressions for Pattern Matching, 1400 possible points) from the Regular Expressions in Python course. This is a pass-fail assignment, and in order to pass you need to achieve a least 2625 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.
In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course. It should look something like this where it shows the total points earned in each chapter:
Submit the file(s) for the homework to the assignment section of Quercus. This allows us to keep track of your progress while also producing a standardized way for you to check on your assignment "grades" throughout the course.
You will have until 13:59 hours on Thursday, July 29th to submit your assignment just before the start of lecture that week.
Revision 1.0.0: materials prepared by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.0: edited and prepared for CSB1021H S LEC0140, 06-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
End-of-line (EOL, a.k.a. Newline, line ending, line feed, or line break) are control character(s) that represent the end of a line and the start of a new one. Unix (Linux and Mac) uses "the linefeed" ("\n") single character, while Windows uses "carriage return" and "linefeed" ("\r\n", a.k.a. "CRLF"). You need to be careful about transferring files between Windows machines and Unix machines to make sure the line endings are translated properly. This is specially critical when you prepare scripts in a personal computer that runs on Windows and then execute those scripts on a server that runs on Linux. Shell programs, in particular, will fail in mysterious ways if they contain DOS line endings. From the Unix end, tools such as dos2unix and unix2dos allow you to interconvert between EOL formats. In a Windows machine, the format conversion can be done using the more command: TYPE input_filename | MORE /P > output_filename.
# Remember that the "!" accesses the equivalent of our OS command prompt
! file data/sequences.tsv # ASCII is the linux convention
! file data/sequence_Windows.txt # ASCII text, with CRLF line terminators is the Windows convention
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.